STAT 450: Case Studies in Statistics

Keegan Korthauer

2025-01-08

Course details

  • Class meetings: Wed/Fri 9:00 - 10:30
  • Labs: Fridays 15:00 - 16:00
  • Office hours: by appointment
  • Instructors:
    • Keegan Korthauer
    • Rodolfo Lourenzutti
    • Melissa Lee
  • Teaching Assistant: Gian Carlo Di-Luvi
  • Writing Assistant: Estella Qi
  • Course Email: stat450@stat.ubc.ca

Today

  • Course syllabus and course details
  • Privacy and security
  • Intro to Art of Data Science
  • Jupyter Notebooks

Canvas

Announcements, course schedule, lecture notes, assignments, and grade center will be maintained through Canvas

It is important that you check this page regularly!

Evaluation

Due dates on assignments are posted on Canvas

  • Homework: 5%
  • Lecture attendance: 5%
  • Labs: 10%
  • Group project: 80%
  • Group project (real case study): 80%
    • Team Work Contract: 5%
    • Group Proposal: 5%
    • Client Interaction: 10%
    • Group written 1st Report: 10%
    • Individual in-class questions about project: 5%
    • Midterm Peer Assessment on Group Work: 2.5%
    • Group Written 2nd Draft: 10%
    • Group Final Report: 10%
    • Group Oral Presentation: 10%
    • Group Poster Session: 10%
    • Final Peer Assessment on Group Work: 2.5%

Textbook

What is Statistics?

  • Statistics is the study of the collection, organization, analysis, interpretation and presentation of data (Dodge, Y. (2003) The Oxford Dictionary of Statistical Terms, OUP)
  • Statistics is the science of learning from data, and of measuring, controlling, and communicating uncertainty (Davidian and Louis, DOI: 10.1126/science.1218685)
  • Data: values of qualitative or quantitative variables that are collected within a particular setting and carry information and knowledge about that setting.

Why Statistics?

Why R?

(And why can’t I just use Excel???)

Recipe Analogy

  • Using a programming language is like baking with a recipe:

    • Ingredients = data

    • Recipe = code

Recipe Results

Someone else can use your recipe (code) to bake the same cake (produce the same data analyses)

What is the goal of STAT450?

Statistics is the study of the collection, organization, analysis, interpretation and presentation of data (Dodge, Y. (2003) The Oxford Dictionary of Statistical Terms, OUP).

  • Students of STAT450 will develop skills to:
    • understand how the data was collected and its implications in the subsequent analysis

What is the goal of STAT450?

Statistics is the study of the collection, organization, analysis, interpretation and presentation of data (Dodge, Y. (2003) The Oxford Dictionary of Statistical Terms, OUP).

  • Students of STAT450 will develop skills to:
    • understand how the data was collected and its implications in the subsequent analysis
    • organize the data in a way that can be analyzed

What is the goal of STAT450?

Statistics is the study of the collection, organization, analysis, interpretation and presentation of data (Dodge, Y. (2003) The Oxford Dictionary of Statistical Terms, OUP).

  • Students of STAT450 will develop skills to:
    • understand how the data was collected and its implications in the subsequent analysis
    • organize the data in a way that can be analyzed
    • analyze the data using appropriate statistical methods to answer the client’s question(s)

What is the goal of STAT450?

Statistics is the study of the collection, organization, analysis, interpretation and presentation of data (Dodge, Y. (2003) The Oxford Dictionary of Statistical Terms, OUP).

  • Students of STAT450 will develop skills to:
    • understand how the data was collected and its implications in the subsequent analysis
    • organize the data in a way that can be analyzed
    • analyze the data using appropriate statistical methods to answer the client’s question(s)
    • interpret the results

What is the goal of STAT450?

Statistics is the study of the collection, organization, analysis, interpretation and presentation of data (Dodge, Y. (2003) The Oxford Dictionary of Statistical Terms, OUP).

  • Students of STAT450 will develop skills to:
    • understand how the data was collected and its implications in the subsequent analysis
    • organize the data in a way that can be analyzed
    • analyze the data using appropriate statistical methods to answer the client’s question(s)
    • interpret the results
    • present and communicate the results

How will this be achieved?

  • Most course activities will be organized around a case study and reading assignments
  • Students will participate in:
    • interactive class discussions
    • formulation of statistical approaches to solve research problems
    • data exploration, model building and statistical inference
    • written and oral presentations of results

Be prepared for a non-typical course!

Communication

Real data, real work, real challenge!!

You will work throughout the course on a real CASE STUDY!

Goal:

  • Promote collaborative and interdisciplinary work

  • Give you the opportunity and experience to work on a real case with a real client (something to include in your job portfolio!!)

  • Improve analytical, computational, and communication skills

Project: a real client/collaborator

  • Case studies will be posted soon – you will get to select your top choices
  • Based on class discussions, we will form groups and assign a project to each group (3-4 students)
  • 3 meetings with the client during class time (organized by students):
    • 1st meeting: introduction, face to face questions about project, data, and goals
    • 2nd meeting: preliminary results, check if the client has further questions
    • 3rd meeting (Poster Session): final results to the client

Project: group work

Students will work in groups on one assigned case

  • You will conduct a statistical analysis and write a report using a transparent and reproducible R Markdown report
  • Group oral presentation, final group report and poster presentation

Some projects from previous years

STAT 450 Poster session 2024

STAT 450 Poster session 2023

STAT 450 Poster session 2022

Lectures

  • Statistical topics (some known, some new) using a case study
    • Work in groups on activities
    • Class discussions
    • R-coding
    • Some data analysis
  • Communication skills (oral and written)
    • Guidelines to improve communication skills
    • Individual presentations
    • Group oral presentation
    • Class discussion
  • Client’s project
    • Discussion time during some lectures

Labs

  • Labs are mandatory - participation worth 10% of grade
    • TA will go over key concepts needed for the course
    • Labs start this week!
    • Lab time is a great time to work with your group with the assistance of the TA and/or designated instructor

Computing/tools

  • Students are expected to be familiar with R and RStudio
    • Some guidelines to R-coding will be covered in lectures and labs
    • Most labs will be focused on the use of R to solve problems discussed in class or related to the case studies
    • If you have never used R before, you may need to complement these seminars with some self-guided learning
  • We will use GitHub for collaboration
    • The first few labs will cover important tools used in the course: Rstudio/Rmarkdown/GitHub
  • We will use Slack for communication with your project groups - Check out Canvas for the link (expires February 5)!

Academic Concession

Deadline for all assignments is 11:59 pm (Pacific time) on the due date

  • Any submission or modification after the due date will not be graded unless you have requested an extension
  • If you anticipate having trouble meeting a deadline and need an academic concession, please reach out in advance via email to the instructors

If you miss class, we suggest you to:

  • Consult the class resources on Canvas
  • Use the class Slack workspace to discuss missed material with classmates
  • Visit office hours
  • Seek academic concessions, if applicable

Academic Misconduct

Plagiarism occurs where an individual submits or presents the oral or written work of another person or generative Artificial Intelligence (AI) tool as their own.

  • When words (i.e. phrases, sentences, or paragraphs), ideas, or entire works are taken from elsewhere, their source must be acknowledged. Failure to provide proper attribution is plagiarism.
  • If you choose to use generative AI tools to complete coursework, you must disclose your use of them. This disclosure must be included at the top of the submission file for the assignment in which the generative AI tool was used. The disclosure should include the name of the tool and a brief description of how it was used.
    • No client data or specific details of client’s study can be shared with generative AI tools (privacy/intellectual property risk)
    • You are responsible for verifying the accuracy and correctly attributing any information you take from generative AI output

Academic Integrity (from Learning Commons UBC)

  • Creating and expressing your own original ideas
  • Engaging with the ideas of others
  • Explicitly acknowledging the sources of your knowledge (accurate citation practices)
  • Completing assignments independently or acknowledging collaboration when appropriate
  • Accurately reporting the results of your research
  • Taking exams without cheating

Tips from Learning Commons UBC

When reviewing your work, ask yourself:

  • Is the idea or argument presented mine?

  • Are the words my own?

  • Can my work be clearly distinguished from the work of others?

Visit the Learning Commons’ guide to academic integrity for support in understanding how to prevent unintentional plagiarism.

Research Ethics (homework)

As many of the projects will involve research that was conducted on humans, everyone will be required to complete an online training module on the Ethical Conduct for Research Involving Humans (see “Homework 0”):

  • HW0 has already been released and is due Jan 13th
    • A self-paced online Ethics course – please note this make take several hours to complete so don’t leave it to the last minute!

Data Privacy

You will be working with data from researchers in the community. It is important that these data are stored securely on your (encrypted) computer!

  • do not upload data to GitHub repositories (servers in US)

  • no analyses run on public computers

  • no data in Google Drive/Slack and careful about using Google Doc for reports (consult with client)

What is sensitive data?

For some of the projects you may be working with sensitive data

  • Some researchers collect data under restricted legal data sharing agreements
  • Data may contain personal information and/or identifiable information

The researcher has the right to maintain their data private and secure, so make sure to encrypt your computer and follow privacy and information security best practices:

Keep personal information and client data secure

This includes GitHub and Slack

From The Art of Data Science

  • “Data analysis is hard, and part of the problem is that few people can explain how to do it”
  • Analogy between a data analyst and a songwriter:
  • knowledge + theory + practice + art

Data analysis versus conducting a study

(from The Art of Data Science)

  • “data analysis presumes the data have already been collected”

  • “a study includes the development of a hypothesis or question, the designing of the data collection process (or study protocol), the collection of the data, and the analysis and interpretation of the data”

  • IMPORTANT note: a data analyst needs to understand (or ask) the data collection process!!

Epicycles of Analysis

(Figure 1 from The Art of Data Science)

Matching expectations to Data

(from The Art of Data Science)

  • “One key indicator of how well your data analysis is going is how easy or difficult it is to match the data you collected to your original expectations”
  • The first 2 meetings with our clients will help you:
    • collect the information you need to evaluate/revise expectations

Summary

  • STAT450 is a non-typical course
  • We won’t give you a “formula” on how to perform a good data analysis
  • We will review important steps in the process
  • We will conduct a real data analysis:
    • Learn and set expectations
    • Collect information
    • Revise expectation
    • Communicate results!!